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linear models 
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Abstract 

Ordinary least squares (OLS) is the default method for fitting linear models, but is 
not applicable for problems with dimensionality larger than the sample size. For these 
problems, we advocate the use of a generalized version of OLS motivated by ridge 
regression, and propose two novel three-step algorithms involving least squares fitting 
and hard thresholding. The algorithms are methodologically simple to understand 
intuitively, computationally easy to implement efficiently, and theoretically appealing 
for choosing models consistently. Numerical exercises comparing our methods with 
penalization-based approaches in simulations and data analyses illustrate the great 
potential of the proposed algorithms. 


1 Introduction 


Long known for its consistency, simplicity and optimality under mild conditions, ordinary 
least squares (OLS) is the most widely used technique for fitting linear models. Developed 
originally for htting hxed dimensional linear models, unfortunately, classical OLS fails in 
high dimensional linear models where the number of predictors p far exceeds the number 
of observations n. To deal with this problem, Tibshirani ( |1996 ) proposed £i-penahzed re¬ 
gression, a.k.a, lasso, which triggered the recent overwhelming exploration in both theory 
and methodology of penalization-based methods. These methods usually assume that only 
a small number of coefficients are nonzero (known as the sparsity assumption), and mini¬ 
mize the same least squares loss function as OLS by including an additional penalty on the 
coefficients, with the typical choice being the ii norm. Such “penalization” constrains the 
solution space to certain directions favoring sparsity of the solution, and thus overcomes the 
non-unique issue with OLS. It yields a sparse solution and achieves model selection consis¬ 


tency and estimation consistency under certain conditions. See Zhao and Yu (2006); Fan 


and Li (2001); Zhang (2010); Zou and Hastie (2005) 


1 



















Despite the success of the methods based on regularization, there are important issues 
that can not be easily neglected. On the one hand, methods using convex penalties, such 


as lasso, usually require strong conditions for model selection consistency (Zhao and Yu 


2006 Lounici, 2008). On the other hand, methods using non-convex penalties (Fan and 


Li, 2001| Zhang, 2010) that can achieve model selection consistency under mild conditions 
often require huge computational expense. These concerns have limited the practical use of 


regularized methods, motivating alternative strategies such as direct hard thresholding (Jain 


et al., 2014). 


In this article, we aim to solve the problem of htting high-dimensional sparse linear mod¬ 
els by reconsidering OLS and answering the following simple question: Can ordinary least 
squares consistently £t these models with some suitable algorithms? Our result provides an 
affirmative answer to this question under fairly general settings. In particular, we give a 
generalized form of OLS in high dimensional linear regression, and develop two algorithms 
that can consistently estimate the coefficients and recover the support. These algorithms 
involve least squares type of htting and hard thresholding, and are non-iterative in nature. 
Extensive empirical experiments are provided in Section 4 to compare the proposed esti¬ 
mators to many existing penalization methods. The performance of the new estimators is 
very competitive under various setups in terms of model selection, parameter estimation and 
computational time. 


1.1 Related Works 


The work that is most closely related to ours is Yang et al. (2014), in which the authors 


proposed an algorithm based on OLS and ridge regression. However, both their methodology 
and theory are still within the ii regularization framework, and their conditions (especially 
their C-Ridge and C-OLS conditions) are overly strong and can be easily violated in practice. 


Jain et al. (2014) proposed an iterative hard thresholding algorithm for sparse regression. 


which shares a similar spirit of hard thresholding as our algorithm. Nevertheless, their 
motivation is completely different, their algorithm lacks theoretical guarantees for consistent 
support recovery, and they require an iterative estimation procedure. 


1.2 Our Contributions 

We provide a generalized form of OLS for htting high dimensional data motivated by ridge 
regression, and develop two algorithms that can consistently ht linear models on weakly 
sparse coefficients. We summarize the advantages of our new algorithms in three points. 
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1. Our algorithms work for highly correlated features under random designs. The consis¬ 
tency of the algorithms relies on a moderately growing conditional number, as opposed 


to the strong ir represent able condition (Zhao and Yu, 2006 Wainwright, 2009) required 
by lasso. 

2. Our algorithms can achieve consistent identify strong signals for ultra-high dimensional 
data (logp = o{n)) with only a bounded variance assumption on the noise e, i.e., 


var{e) < oo. This is remarkable as most methods (c.f. 

Zhang 

(2010 

); 

Yang et ah 

(2014); |Cai and Wang 

(2011 

); 

Wainwright (2009 

); Zhang and Huang (2008); Wang 


and Leng (2015)) that work for logp = o{n) case rely on a sub-Gaussian tail/bounded 


error assumption, which might fail to hold for real data. Lounici ( 2008[ ) proved that 
lasso also achieves consistent model selection with a second-order condition similar to 
ours, but requires two additional assumptions. 


3. The algorithms are simple, efficient and scale well for large p. In particular, the matrix 
operations are fully parallelizable with very few communications for very large p, while 
regularization methods are either hard to be computed in parallel in the feature space, 
or the parallelization requires a large amount of machine communications. 


The remainder of this article is organized as follows. In Section 2 we generalize the 
ordinary least squares estimator for high dimensional problems where p > n, and propose 
two three-step algorithms consisting only of least squares htting and hard thresholding in a 
loose sense. Section 3 provides consistency theory for the algorithms. Section 4 evaluates 
the empirical performance. We conclude and discuss further implications of our algorithms 
in the last section. All the proofs are provided in the supplementary materials. 


2 High dimensional ordinary least squares 


Consider the usual linear model 


Y = Xl3 + e, 

where X is the n x p design matrix, Y is the n x 1 response vector and (3 is the coefficient. 
In the high dimensional literature, /dj’s are routinely assumed to be zero except for a small 
subset S* = supp{(3). In this paper, we consider a slightly more general setting, where f3 is 
not exactly sparse, but consists of both strong and weak signals. In particular, we dehned 


3 












































S* and S, 


S* = {k : |/3fc| > r*} S, = {k : |/?fc| < r,} 

as the strong and weak signal sets and S'* U S'* = {1, 2, ■ ■ ■ ,p}. The algorithms developed in 
this paper is to recover the strong signal set S*. The specific relationship between r* and r* 
will be detailed later. 

To carefully tailor the low-dimensional OLS estimator in a high dimensional scenario, one 
needs to answer the following two questions: i) What is the correct form of OLS in the high 
dimensional setting? ii) How to correctly use this estimator? To answer these, we reconsider 
OLS from a different perspective by viewing the OLS as the limit of the ridge estimator with 
the ridge parameter going to zero, i.e., 

{X'^Xy^X^Y = lim {X^X + rL)-^X'^Y. 

r^O 

One nice property of the ridge estimator is that it exists regardless of the relationship between 
p and n. A keen observation reveals the following relationship immediately. 

Lemma 1. For any p,n,r > 0, we have 

{X^X + rIp)-^X^Y = X'^iXX^ + rIn)~^Y. ( 1 ) 

Notice that the right hand side of Q exists when p > n and r = 0. Consequently, we 
can naturally extend the classical OLS to the high dimensional scenario by letting r tend to 
zero in ([^. Denote this high dimensional version of the OLS as 

= lim + rIn)~^Y = X^{XX^)-^Y. 

r^O 

The above equation indicates that is essentially an orthogonal projection of /? onto 

the row space of X. Unfortunately, this (low dimensional) projection does not have good 
general performance in estimating sparse vectors in high-dimensional cases. Instead of di¬ 
rectly estimating /? as however, this new estimator of (3 may be used for dimension 

reduction by observing = X'^{XX'^)~^X[3 + X'^{XX'^)~^e = <h/S -1- p. Since p is 

stochastically small, if <I> is close to a diagonally dominant matrix and f3 is sparse, then the 
strong and weak signals can be separated by simply thresholding the small entries of 
The exact meaning of this statement will be discussed in the next section. Some simple ex¬ 
amples demonstrating the diagonal dominance of are illustrated immediately 

in Figure where the rows of X in the left two plots are drawn from A^(0, S) with cxjj = 0.6 
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Figure 1: Examples for X^(XX^) ^X. Left: X ~ A^(0,S) with aij = 0.6 and an = 1; 
Middle: X ~ A^(0, E) with aij = Right: Real data from Section 4. 


or aij = 0.99l*“-^L The sample size and data dimension are chosen as {n,p) = (50,1000). The 
right plot takes the standardized design matrix directly from the real data in Section 4 with 
{n,p) = (395,767). A clear diagonal dominance pattern is visible in each plot. 

This ability to separate strong and weak signals allows us to hrst obtain a smaller model 
with size d such that \S*\ < d < n containing S*. Since d is below n, one can directly apply 
the usual OLS to obtain an estimator, which will be thresholded further to obtain a more 
rehned model. The hnal estimator will then be obtained by an OLS ht on the rehned model. 
This three-stage non-iterative algorithm is termed Least-squares adaptive thresholding (LAT) 
and the concrete procedure is described in Algorithm [T] 


Algorithm 1 The Least-squares Adaptive Thresholding (LAT) Algorithm 

Initialization: 

1: Input (y, X), d, 5 
Stage 1 : Pre-selection 

2: Standardize Y and X to R and X having mean 0 and variance 1. 

3 : Compute = X^(XX^ -|- 0.1 ■ In)~^Y . Rank the importance of the variables by 

I A'"”! 

4 : Denote the model corresponding to the d largest as Alternatively use ex¬ 

tended BIG 
to select the best submodel. 

Stage 2 : Hard thresholding 
5 : = (XLXj^J-^XLY; 

6 : 

^ 

8: Threshold by MEAN{\/2d‘^Cii log(4d/d)) or use BIG to select the best submodel. 

Denote the chosen model as Ad. 

Stage 3 : Rehnement 

10: y = 0,Vi ^ Ad; 

11: return (3. 


(Chen and Chen, 2008) in conjunction with the obtained variable importance 
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The input parameter d is the submodel size selected in Stage 1 and 6 is the tuning 
parameter determining the threshold in Stage 2; these values will be specihed in Section 
4. In Stage 1, we use + 0.1 ■ In)~^Y instead of = X^{XX^)-^Y 

because XX^ is rank dehcient (the rank is n — 1) after standardization. The number 0.1 


can be replaced by any arbitrary small number. As noted in Wang and Leng (2015), this 
additional ridge term is also essential when p and n get closer, in which case the condition 
number of XX"^ increases dramatically, resulting in an explosion of the model noise. Our 
results in Section 3 mainly focus on = X"^{XX'^)~^Y where X is assumed to be drawn 

from a distribution with mean zero, so no standardization or ridge adjustment is required. 
However, the result is easy to generalize to the case where a ridge term is included. See 


Wang and Leng (2015). 


The Stage 1 of Algorithm is very similar to variable screening methods (Fan and Lv 


2008 Wang and Leng, 2015). However, most screening methods require a sub-Gaussian 


condition the noise to handle the ultra-high dimensional data where log(p) = o{n). In 
contrast to the existing theory, we prove in the next section a better result that Stage 1 of 
Algorithm [T] can produce satisfactory submodel even for heavy-tailed noise. 

The estimator in Stage 2 can be substituted by its ridge counterpart _ 

(AL Xj^^ -|- rI(i)~^X'^Y and C by {X'^^Xj^^ + rld)~^ to stabilize numerical computation. 


Similar modihcation can be applied to the Stage 3 as well. The resulted variant of the 
algorithm is referred to as the Ridge Adaptive Thresholding (RAT) algorithm and described 
in Algorithm 

We suggest to use 10-fold cross-validation to tune the ridge parameter r. Notice that the 
model is already small after stage 1, so using cross-validation will not signihcantly increase 
the computational burden. The computational performance is illustrated in Section 4. 


3 Theory 


In this section, we prove the consistency of Algorithm [T] in identifying strong signals and 
provide concrete forms for all the values needed for the algorithm to work. Recall the linear 
model Y = Xjd + e. We consider the random design where the rows of X are drawn from 
an elliptical distribution with a density of g{xfT~^Xi) for some nonnegative function g and 
positive dehnite S. It is easy to show that Xi admits an equivalent representation as 



Li 


yp^i ^1/2 
lki||2 


ypLj 




( 2 ) 
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Algorithm 2 The Ridge Adaptive Thresholding (RAT) Algorithm 


Initialization: 

1 : Input {Y, X), d, 5,r 
Stage 1 : Pre-selection 


Standardize Y and X to F and X having mean 0 and variance 1. 

Compute = X'^{XX'^ -|- 0.1 ■ In)~^Y. Rank the importance of the variables by 

Denote the model corresponding to the d largest as Aid- Alternatively use eBIC 

in Chen and Chen (2008) in conjunction with the obtained variable importance to select 


the best submodel. 

Stage 2 : Hard thresholding 

5 

6 
7 


^ (XTX^^ + 


xi-A 

Md ’ 


Threshold by MEAN('\/25^Chlog(4d7^) or use BIC to select the best submodel. 

Denote the chosen model as Ai. 

Stage 3 : Rehnement 

9 ^ Pm = 

10 : j3i = 0 , Vi ^ Ai] 

11 : return (3. 




where Zi is a p-variate standard Gaussian random variable and L* is a nonnegative random 
variable that is independent of Zi. We denote this distribution by EN{L, E). This random 
design allows for various correlation structures among predictors and contains many distribu¬ 
tion families that are widely used to illustrate methods that rely on the restricted eigenvalue 


conditions (Bickel et al., 2009; Raskutti et ah, 2010). The noise e, as mentioned earlier, is 
only assumed to have the second-order moment, i.e., var{e) = cr^ < oo, in contrast to the 


sub-Gaussian/bounded error assumption seen in most high dimension literature. See Zhang 


(2010); Yang et al. (2014); Gai and Wang (2011); Wainwright (2009); Zhang and Huang 


(2008). This relaxation is similar to Lounici (2008); however we do not require any further 
assumptions needed by Lounici (2008). In Algorithm]^ we also propose to use extended 
BIG and BIG for parameter tuning. However, the corresponding details will not be pursued 
here, as their consistency is straightforwardly implied by the results from this section and 


the existing literature on extended BIG and BIG (Ghen and Ghen, 2008). 


As shown in the variable L controls the signal strength of Xi, we thus need a lower 
bound on Lj to guarantee a good signal strength. Dehne k = condiY). We state our result 
in three theorems. 


Theorem 1. Assume Xi ~ EN{Li,T) with E[LE] < Mi and Si is a random variable with 
a bounded variance We also assume p > cqu for some cq > 1 and var{Y) < Mq. If 
[S'*! logp = o{n), n > 4co/(co — 1 )^, and r* > Ak^, then we can choose 7 to be 
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where ci is some absolute constant specified in Lemma^ and for any a E (0,1) we have 


P ( max I < 7 < min I ^ 


V *65h 


ies* 




{HD) 


= 1-0 


nr logp 
T*‘^n°‘ 


Theorem [T] guarantees the model selection consistency of the hrst stage of Algorithm 
It only requires a second-moment condition on the noise tail, relaxing the sub-Gaussian 
assumption seen in other literature. The probability term shows that the algorithm requires 
the important signals to be lower bounded by a signal strength of with a positive a. 

In addition, a gap of r*/r* > is needed between the strong signals and the weak signals 
in order for a successful support recovery. 

As 7 is not easily computable based on data, we propose to rank the and select 

d largest coefficients. Alternatively, we can construct a series of nested models formed by 


ranking the largest n coefficients and adopt the extended BIG (Ghen and Ghen, 2008) to 


select the best submodel. Once the submodel Md is obtained, we proceed to the second 
stage by obtaining an estimate via ordinary least squares corresponding to M.d- The 

theory for requires more stringent conditions, as we now need to estimate instead 

of just a correct ranking. In particular, we have to impose conditions on the magnitude of 
fis, and the moments of L, i.e., for we have the following result. 

Theorem 2. Assume the same conditions for X and e as in Theorem We also assume 
n > QAKd\ogp andd-\S*\ < h for some c> 0. If E[L-^^] < Mi, E[M^] < M 2 , n < 
and there exists some l E (0,1) such that 1^*1^ — > 0, we have 


p( max 

V \M\<d, S*(lM 


< 2a 


logp \ 

n" / 


-1 I M 1 + M 2 ^ {Mi + M2)R^ 


77.3 


(1-a) 


n 


|(l-4a) (logp)^‘n^ 


-4a-2t / ’ 


i.e., if T* > Scr-y/^^, then we can choose 7 ' = Suy/^^ 


and 


max < 7 ' < min 


with probability tending to 1. 


The moment condition on L is not tight. We used this number just for simplicity. As 
shown in Theorem]^ the norm of fis* is allowed to grow in the rate of 

i.e., our algorithm works for weakly sparse coefficients. However, Theorem imposes an 
upper bound on a while Theorem [T] not. This is mainly due to the moment assumption on 















L and the different strnctnre between and i.e., does not rely on L for 

diminishing the nnimportant signals. For ridge regression, we have the following result. 

Theorem 3 (Ridge regression). Assume all the conditions in Theorem^ If we choose the 
ridge parameter satisfying 

^ (J77,(7/9-5“/18) .y/log p 

^ - 162 kMo ’ 


then we have 


P 


max 

\M\<d,S*^M 


loo < 3(J 


logp \ 

n" / 


fXf'^dlogd 2 M 1 + M 2 (Mi + M 2 )R 3 \ 


i.e., %fT* > 7ctJ^ 


max < 7 ' < min 

with probability tending to 1. 

Note that the ridge parameter r can be chosen as a constant, bypassing the need to 
specify r at least in theory. When both the noise e and X follows Gaussian distribution and 
T* = 0, we can obtain a more explicit form of the threshold 7 ', as the following Corollary 
shows. 

Corollary 1 (Gaussian noise). Assume £ Ni0,a^), X iV(0, S) and r* = 0. For 

any 5 G (0, 1 ), define 7 ' = S\/2a, where a is the estimated standard error as 
= J2'i=iiyi ~ ~ d). For sufficiently large n, if d < n — 4iF^ log(2/(5)/c for some 

absolute constants c, K and t* > 2Aa, then with probability at least 1 — 26, we 
have 


|/3(0LS)| > y y. ^ <7' Vi ^ 

Write C = (X^^Xy^^)”^ as in Algorithm 1. In practice, we propose to use 7' = 
mean{\/2a‘^Cii \og{Ad/6)) as the threshold (see Algorithm [^, because the estimation er¬ 
ror takes a form of ^/a^^^ii\og{Ad[^. Alternatively, instead of identifying an explicit form 
of the threshold value (as is hard for general noise), one may also use BIG on nested models 
formed by ranking to search for the true model. Once the final model is obtained, 

as in Stage 3 of Algorithm 1 , we refit it again using ordinary least squares. The final output 
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will have the same output as if we knew the true model a priori with probability tending to 
1 , i.e., we have the following result. 


Theorem 4. Let A4 and jd he the final output from LAT or RAT. Assume all conditions in 


Theorem 
we have 


2 and 


5| Then with probability at least 1 - O ( 


M = Sf \\(ds^-Mi< 


2 ^ 2\S*\a^\ogp ,1,3 ,31, /,,_./logp 




and lift - /9||oo < 2a 




As implied by Theorem B-i LAT and RAT can consistently identify strong signals in 
the ultra-high dimensional (logp = o{n)) setting with only the bounded moment assumption 
var{e) < oo, in contrast to most existing methods that require e ~ iV(0,cT^) or ||e:||oo < oo. 


4 Experiments 


In this section, we provide extensive numerical experiments for assessing the performance of 
LAT and RAT. In particular, we compare the two methods to existing penalized methods 
including lasso, elastic net {enet (Zou and Hastie, 2005[ )), adaptive lasso (Zou, 2006), scad 
(Fan and Li, 2001) and mc+ ( |Zhang , [2010 ). As it is well-known that the lasso estimator is 
biased, we also consider two variations of it by combining lasso with Stage 2 and 3 of our 
LAT and RAT algorithms, denoted as lasLAT {Iasi in Figures) and lasRAT {las2 in Figures) 
respectively. We note that the lasLat algorithm is very similar to the thresholded lasso 


(Zhou, 2010) with an additional thresholding step. We code LAT and RAT and adaptive 


lasso in Matlab, use glmnet (Friedman et ah, 2010) for enet and lasso, and SparseReg (]Zhou 


et al., 2012 Zhou and Lange, 2013) for scad and mc+. Since adaptive lasso achieves a similar 


performance as lasLat on synthetic datasets, we only report its performance for the real data. 


4.1 Synthetic Datasets 

The model used in this section for comparison is the linear model Y = Xf3 + e, where 
£ ~ A^(0, cr^) and X ~ A^(0, S). To control the signal-to-noise ratio, we define r = ||/ 3 || 2 /ct, 
which is chosen to be 2.3 for all experiments. The sample size and the data dimension are 
chosen to be {n,p) = (200,1000) or {n,p) = (500,10000) for all experiments. For evaluation 
purposes, we consider four different structures of S below. 

(i) Independent predictors. The support is set as 5 = {1, 2, 3,4,5}. We generate Xi from 
a standard mnltivariate normal distribution with independent components. The coefficients 
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are specified as 


A = 


(-l)“»(|iV(0,1)1 + 1), Ui ~ 5er(0.5) i e S 
0 i^S. 


(ii) Compound symmetry. All predictors are equally correlated with correlation p = 0.6. 
The coefficients are set to be /S* = 3 for z = 1,..., 5 and A = 0 otherwise. 


(iii) Group structure. This example is Example 4 in Zou and Hastie (2005), for which we 
allocate the 15 true variables into three groups. Specihcally, the predictors are generated as 

Xl+3m = A + A^(0, 0.01), 

X2+3m = Z2 + A^(0,0.01), 

X3+3m = Z3 + Ar(0,0.01), 

where m = 0,1, 2,3,4 and Zi ~ A^(0,1) are independent. The coefficients are set as 

I3i = 3, z = 1, 2, ■ • ■ , 15; A = 0, z = 16, ■ • • , p. 


(iv) Factor models. This model is also considered in Meinshausen and Biihlmann (2010) 
and Cho and Fryzlewicz (2012). Let = 1,2, ■ ■ ■ , k he independent standard normal 


variables. We set predictors as Xi = Yl'j=i 4>jfij + Ii: where fij and rji are generated from 
independent standard normal distributions. The number of factors is chosen as /c = 5 in the 
simulation while the coefficients are specihed the same as in Example (ii). 


Square root of error 


False positives 


False negatives 


2 

1.5 

1 



LAT RATIasso Iasi Ias2 Enetscad mc+ 


LAT RATIasso Iasi Ias2 Enetscad mc+ 


LAT RATIasso Iasi Ias2 Enetscad mc+ 


Figure 2: The Boxplots for Example (i). Left: Estimation Error; Middle: False Positives; 
Right: False Negatives 

To compare the performance of all methods, we simulate 200 synthetic datasets for 
(rz,p) = (200,1000) and 100 for {n,p) = (500,10000) for each example, and record i) the 
root mean squared error (RMSE): ||/3 — f3\\2, h) the false negatives FN), iii) the 
false positives FP) and iv) the actual runtime (in milliseconds). We use the extended 


11 





















3.5 p 
3 - 

2.5 - 


Square root of error 




False positives 


LAT RATIassolasI Ias2 Enetscad mc+ 



False negatives 


LAT RATIassolasI tas2 Enetscad mc+ 


Figure 3: The Boxplots for Example (ii). Left: Estimation Error; Middle: False Positives; 
Right: False Negatives 
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LAT RATIassolasI Ias2 Enetscad mc+ LAT RATIassolasI Ias2 Enetscad mc+ 



talse positives 


False negatives 



Figure 4: The Boxplots for Example (hi). Left: Estimation Error; Middle: False Positives; 
Right: False Negatives 
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False negatives 


LAT RATIassolasI Ias2 Enetscad mc+ 


Figure 5: The boxplots for Example (iv). Left: Estimation Error; Middle: False Positives; 
Right: False Negatives 
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Table 1: Results for (n,p) = (500,10000) 


Example 


LAT 

RAT 

lasso 

lasLAT 

lasRAT 

enet 

scad 

mc+ 


RMSE 

0.263 

0.264 

0.781 

0.214 

0.214 

1.039 

0.762 

0.755 

Ex.(i) 

# FPs 

0.550 

0.580 

0.190 

0.190 

0.190 

0.470 

0.280 

0.280 


# FNs 

0.010 

0.010 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 


Time 

36.1 

41.8 

72.7 

72.7 

74.1 

71.8 

1107.5 

1003.2 


RMSE 

0.204 

0.204 

0.979 

0.260 

0.260 

1.363 

0.967 

0.959 

Ex. (ii) 

# FPs 

0.480 

0.480 

1.500 

0.350 

0.350 

10.820 

2.470 

2.400 


# FNs 

0.000 

0.000 

0.040 

0.040 

0.040 

0.040 

0.020 

0.020 


Time 

34.8 

40.8 

76.1 

76.1 

77.5 

82.0 

1557.6 

1456.1 


RMSE 

9.738 

1.347 

7.326 

17.621 

3.837 

1.843 

7.285 

8.462 

Ex. (hi) 

# FPs 

0.000 

0.000 

0.060 

0.000 

0.000 

0.120 

0.120 

0.090 


# FNs 

4.640 

0.000 

1.440 

13.360 

1.450 

0.000 

1.800 

2.780 


Time 

35.0 

41.6 

75.6 

75.6 

77.5 

74.4 

6304.4 

4613.8 


RMSE 

0.168 

0.168 

1.175 

0.256 

0.256 

1.780 

0.389 

0.368 

Ex. (iv) 

# FPs 

0.920 

0.920 

21.710 

0.260 

0.260 

37.210 

6.360 

6.270 


# FNs 

0.010 

0.010 

0.140 

0.140 

0.140 

0.450 

0.000 

0.000 


Time 

34.5 

41.1 

78.7 

78.7 

80.8 

81.4 

1895.6 

1937.1 


BIC (Chen and Chen, 2008) to choose the parameters for any regularized algorithm. Due to 
the huge computation expense for scad and mc+, we only hnd the hrst \^/p\ predictors on 
the solution path (because we know s ^/p)■ For RAT and LAT, d is set to 0.3 x n. For 
RAT and larsRidge, we adopt a 10-fold cross-validation procedure to tune the ridge param¬ 
eter r for a better hnite-sample performance, although the theory allows r to be hxed as a 
constant. For all hard-thresholding steps, we £x 5 = 0.5. The results for {n,p) = (200,1000) 
are plotted in Figure ill@ and and a more comprehensive result (average values for 
RMSE, ^ FPs, ^ FNs, runtime) for (n,p) = (500,10000) is summarized in Table 


As can be seen from both the plots and the tables, LAT and RAT achieve the smallest 
RMSE for Example (ii), (hi) and (iv) and are on par with lasLAT for Example (i). For 
Example (hi), RAT and enet achieve the best performance while all the other methods fail 
to work. In addition, the runtime of LAT and RAT are also competitive compared to that 
of lasso and enet. We thus conclude that LAT and RAT achieve similar or even better 
performance compared to the usual regularized methods. 


4.2 A Student Performance Dataset 


We look at one dataset used for evaluating student achievement in Portuguese schools (Cortez 


and Silva, 2008). The data attributes include student grades and school related features that 
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Table 2: Prediction Error of the Final Grades by Different Methods 


methods 

mean error 

Standard error 

average 

size 

model 

runtime 

lisec) 

(mil- 

LAT 

1.93 

0.118 

6.8 


22.3 


RAT 

1.90 

0.131 

3.5 


74.3 


lasso 

1.94 

0.138 

3.7 


60.7 


lasLAT 

2.02 

0.119 

3.6 


55.5 


lasRAT 

2.04 

0.124 

3.6 


71.3 


enet 

1.99 

0.127 

4.7 


58.5 


scad 

1.92 

0.142 

3.5 


260.6 


mc+ 

1.92 

0.143 

3.4 


246.0 


adaptive lasso 

2.01 

0.140 

3.6 


65.5 


null 

4.54 

0.151 

0 


— 


Table 3: 

Prediction Error of the Final Grades Excluding Strong Signals 


methods 

mean error 

Standard error 

average 

model 

runtime 

(mil- 




size 


lisec) 


LAT 

4.50 

0.141 

5.3 


22.4 


RAT 

4.26 

0.130 

4.0 


74.0 


lasso 

4.27 

0.151 

5.0 


318.9 


lasLAT 

4.25 

0.131 

2.9 


316.5 


lasRAT 

4.28 

0.127 

2.8 


331.9 


enet 

4.37 

0.171 

6.0 


265.6 


scad 

4.30 

0.156 

4.8 


387.5 


mc+ 

4.29 

0.156 

4.7 


340.2 


adaptive lasso 

4.24 

0.180 

4.8 


298.0 


null 

4.54 

0.151 

0 


— 



were collected by using school reports and questionnaires. The particular dataset used here 
provides the students’ performance in mathematics. The goal of the research is to predict 
the hnal grade based on all the attributes. 

The original data set contains 395 students and 32 raw attributes. The raw attributes 
are recoded as 40 attributes and form 780 features after interaction terms are added. We 
then remove features that are constant for all students. This gives 767 features for us to 
work with. To compare the performance of all methods, we hrst randomly split the dataset 
into 10 parts. We use one of the 10 parts as a test set, £t all the methods on the other 9 
parts, and then record their prediction error (root mean square error, RMSE), model size 
and runtime on the test set. We repeat this procedure until each of the 10 parts has been 
used for testing. The averaged prediction error, model size and runtime are summarized in 
Table We also report the performance of the null model which predicts the hnal grade on 
the test set using the mean hnal grade in the training set. 
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It can be seen that RAT achieves the smallest cross-validation error, followed by scad 
and mc+. In the post-feature-selection analysis, we found that two features, the 1st and 2nd 
period grades of a student, were selected by all the methods. This result coincides with the 
common perception that these two grades usually have high impact on the hnal grade. 

In addition, we may also be interested in what happens when no strong signals are 
presented. One way to do this is to remove all the features that are related to the 1st and 
2nd grades before applying the aforementioned procedures. The new result without the 
strong signals removed are summarized in Table 

Table shows a few interesting hndings. First, under this artihcial weak signal scenario, 
adaptive lasso achieves the smallest cross-validation error and RAT is the hrst runner-up. 
Second, in Stage 1, lasso seems to provide slightly more robust screening than OLS in that the 
selected features are less correlated. This might be the reason that LAT is outperformed by 
lasLAT. However, in both the strong and weak signal cases, RAT is consistently competitive 
in terms of performance. 


5 Conclusion 

We have proposed two novel algorithms Lat and Rat that only rely on least-squares type 
of htting and hard thresholding, based on a high-dimensional generalization of OLS. The 
two methods are simple, easily implementable, and can consistently £t a high dimensional 
linear model and recover its support. The performance of the two methods are competitive 
compared to existing regularization methods. It is of great interest to further extend this 
framework to other models such as generalized linear models and models for survival analysis. 
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Appendix 0: Proof of Lemma 1 

Applying the Sherman-Morrison-Woodbury formula 

{A + UDV)-^ = - A-^U{D-^ + VA-^U)-WA-\ 

we have 

r(r/p + X'^X)-^ = Ip- X^{In + ^XX^y^X^ = Ip - X^{rln + XX^-^X. 
Multiplying X'^Y on both sides, we get 

r{rlp + X^Xy^X'^Y = X^Y - X^{rln + XX^-^XX^Y. 

The right hand side can be further simplihed as 

x^Y - x^yin + xxy-^xx'^Y 

= X^Y - X^yin + XXy-\rIn + XX^ - rQY 
= X^Y - X^Y + r{rln + XX^-^Y = rX'^yy + XX^-^Y. 

Therefore, we have 

{rip + X'^Xy^X^Y = X^{ry + XXy-^Y. 


Appendix A: Proof of Theorem 

Recall the estimator = X^{XXy-^Y = X^{XXy-^XI3 + X^{XXy-y = i + r]. 

The following three lemmas will be used to bound ^ and r] respectively. 

Lemma 2. Let $ = X"’"{XX'^y^X. Assume p > Cqu for some Cq > 1, then for any C > 0 
there exists some 0 < ci < 1 < C 2 and C 3 > 0 such that for any t > 0 and any i & Q, j ^ i, 


pi |$jj| < ciK ) < 2e" 

' p 




Pfn > C2K- ^ 

pj 


( 3 ) 


and 


P 


<hj,| > C 4 Kt^^'\ < 5e 2e 

p ) 


T '\/c 2 (co—Cl) 

where C 4 = ^ , 

\/c3(co-l) 


( 4 ) 


18 





The proof can be found in the Lemma 4 and 5 in Wang and Leng (2015) for elliptical 


distributions. The special case of Gaussian is also proved in the Lemma 3 of Wang et ah 


(2015). Notice that the eigenvalue assumption in Wang and Leng (2015) is not used for 


proving Lemma 4 and 5. 

Lemma 3. Assume Xi follows EN{L,H). If E[L~‘^] < Mi for some constant Mi > 0, 
var{e) = cr^ and logp = o{n), then for any 0 < a < 1 we have 


P 




where r* is defined as the minimum value for the important signals and n = condiji). 

To prove Lemma we need the following two propositions. 

Proposition 1. (Lounici, 2008 Lounici (2008); Nemirovski, 2000 Akritas et al. (2014)) Let 


Yi G lU’ he random vectors with zero means and finite variances. Then we have for any k 
norm with k G [2, oo] and p > 3, we have 




i=l 


^ < G min{fc, logp} 

i=l 


( 5 ) 


where C is some absolute constant. 

As each row of X can be represented as X = where L = diag{y/pLi/\\zi\\ 2 , ■ ■ ■ , y/pLn/\\zn 

and Z is a matrix of independent Gaussian entries, i.e., Z ~ iV(0,/p). For Z, we have the 
following result. 

Proposition 2. Let Z ~ N{0,lp), then we have the minimum eigenvalue of ZZ'^jp satisfies 
that 

p(Xrmn{ZZ^/p) > (1 - - - > 1 - 2exp(-tV2) 

V p p J 

for any t > 0. Assume p > Cqu for Cq > 1 and take t = s/n. When n > 4cq/(co — 1)^, we 
have 


P 


(^Xrain{Z Z'^ / p) 



> 1 -2exp(-n/2), 


( 6 ) 


where c 


4c2 


The proof follows Gorollary 5.35 in Vershynin (2010). 


Proof of Lemma [H Let A = pX^(XX^)-M and Z = Then p = p-^AP-^e. 
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Part 1. Bounding \Aij\. Consider the standard SVD on Z as Z = where V 

and D are n x n matrices and U is a. p x n matrix. Because Z is a matrix of iid Gaussian 
variables, its distribution is invariant under both left and right orthogonal transformation. 
In particular, for any T G 0{n), we have 

TVDU^ = VDU^, 

i.e., V is uniformly distributed on 0{n) conditional on U and D (they are in fact independent, 
but we don’t need such a strong condition). Therefore, we have 

A = = pE^Z^L(LZEZ^L)-^L = pE^UDV^L(LVDU^EUDV^L)-^L 

= pElu(U^EU)-^D-^V^ = ^eIu(U^EU)-A—)~W^. 

WP 

Because V is uniformly distributed conditional on U and D, the distribution of A is also 
invariant under right orthogonal transformation conditional on U and D, i.e., for any T G 
0{n), we have 


A = AT. 


( 7 ) 


Our first goal is to bound the magnitude of individual entries Aij. Let Vi = ejAA'^Ci, which 
is a function of U and D (see below). From ([^, we know that ejA is uniformly distributed 
on the sphere S'"'“^(yTi) if conditional on Vi (i.e., conditional on U,D), which implies that 



(£) 





( 8 ) 


where x'jS are iid standard Gaussian variables. Thus, Aij can be bounded easily if we can 
bound Vi- Notice that for Vi we have 


Vi = ejAA'^a = pejE^U{U^TU)-A—) \u^EU)-^U'^Ehi. 

p 

= pejH{U^TU)-^—)~\u^TU)-"^H^ei 

p 

< pejHH-^ei ■ ■ KLi^) 


Here H = T^^UiJJ'^TU) is defined the same as in Wang and Leng (2015) and can be 
bounded as ejHH^Ci < C 2 nK/p with probability 1 — 2exp(—Cn) (see the proof of Lemma 3 
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in Wang et al. (2015)). Therefore, we have 

p(^Vi < > 1 - 2 exp(-C'n) 


Now applying the tail bound and the concentration inequality to ([^ we have for any t > 0 
and any C > 0 


P{\xj\ >t)< 2exp(-fV2) 




n 


< C 3 ) < exp(—Cn). 


( 9 ) 


Putting the pieces all together, we have for any f > 0 and any C > 0 that 

p( max|Ay| < > 1 - 2npexp(-t^/2) - 3pexp(-C'n). 

\ ij y C3 p / 

Now according to 0 . we can further bound Xmin{D‘^/p) and obtain that 


P 


^max \Aij\ < ^> 1 — 2 npexp(— 1 ^/ 2 ) — 3pexp(—Cn) — 2exp(—n/2). (10) 


Part 2. Bounding p he second step is to use (10) and Proposition to bound p. The 
procedure follows similarly as in Lounici’s paper. We hrst note that \\Zi\\ \ follows a chi-square 
distribution X‘^{p). We have for any t 


pj ikJ i ,0 
P 


>l + 2i / —I -) <e , 

p p 


from which we know 


P^maxp ^||^j ||2 < 5/2^ > 1 — pe 


( 11 ) 


Nowdehne Wj = {Aijp~^/‘^\\zj\\ 2 Lz^ej, A 2 jP~^/‘^\\zj\\ 2 Lz^ej, ■ ■ ■ , Apjp~^/‘^\\zj\\ 2 L~^ej). It’s 
clear that p = ^j/P- Applying Proposition to hPjs with the l^o norm and noticing 

tht Lj is independent of Zj we have 


7C2 


^11 ^ ^ogp—a‘^KH'^^E[LA] < —a^KH^M^n\ogp. 

j=i j=i j=i 
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Using the Markov inequality on rj, we have for any r > 0 


P 


P 




< 


oo > r < 

'n . / 

7C 2 (j‘^\ogp 


p^EMi _ -Ell e;=i Wi 


12 
I oo 


nr^ 


nr^ 


cc^r^ 


To match our previous result, we take r = ci^/nPn ^/6 and t = rS^ for some small a, 

342020-^^"^Ml logp 




6 p 


> 1-0 


cfccsT’^^ 
a'^K^ logp\ 

T*^n'^ J 


- 2npexp(-n^“"/2) - 3pexp(-C'n) - 2exp(-n/2) 


□ 


Lemma 4. Assume var{Y) < Mq. Define <h = X'^{XX'^) ^X. If p > cqu for some Cq > 1, 
then we have for any t > 0 


p| max^ > C 4 \/MokH^^ j < 2pe + 5pe 

j+i- ^ 


-Cn 


where C 4 , n are defined in Lemma\^ 


Proof. Following Wang and Leng (2015); Wang et ah (2015), we dehne H = X^(XX^) 2 . 


When X ~ A^(0, E), M follows the MACG{TI) distribution as indicated in Lemma 3 in Wang 


et al. (2015) and Theorem 1 in Wang and Leng (2015). For simplicity, we only consider a 


particular case where i = 1 . 

For vector v with vi = 0, we dehne v' = { 02 , 03 , ■ ■ ■ ,OpY' and we can always identify a 
(p — 1) X (p — 1) orthogonal matrix T' such that T'o' = ||n'||2e'i where e'l is a (p — 1) x 1 unit 
vector with the hrst coordinate being 1. Now we dehne a new orthogonal matrix T as 


T = 


1 0 
,0 TL 


and we have 


To = 


1 0 
.0 T' 


0 

klheU 


= ||'y|| 2 e 2 - and = Ci 


1 0 
0 


= e. 
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Therefore, we have 


elHH^v = elT^THH'^'T'^'Tv = e[T'^ Cs = \\v\\2e{HH^'e2. 


jTrpT. 


TrpT TJ ZjTrjiT, 


.T £V £VT, 


Since H follows MACG(T,), H = T^H follows MACG{T'^T,T) for any hxed T. Therefore, 
we can apply Lemma again to obtain that 


n\ -I • TtjtjT ~ ~ " " Gn 


P[ \eiX^{XX^)-^Xv\ > \\v\\ 2 CiKtX-j =P\^\eiHH^v\ > \\v\\ 2 C 4 Kt^ 


= P{ ||n||2|ef TTiL 62! > \\v\\ 2 C 4 KtA— j = P^||n||2|<hi2| > \\v\\ 2 C 4 Kt^ 


= P{ |$i2| > C 4 Kt^^ ) < 5e + 2e * 
P 


Applying the above resnlt to n = (0,/3i we have 


< C4nt\\Ph 


n 




P 


with probability at least 1 — 5e — 2e 

In addition, we know that var{Y) = < Mq and thns 


m\2 < 


Conseqnently, we have 



Now we are ready to prove Theorem 

Proof of Theorem\^ Recall the dehnition of ^ as ^ = X'^{XX'^)~^X[ 3 . For any i we 
have 


e. = eJX^{XX^)-^Xf3 = Y + Y 

j&S j^i 
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For the first term, we have 


n 


min/3j| > ciK Vi G S 

n p 


with probability 1 - \S*\e-^^ and 


n 


min/3j| < ciKT^— Vi G 5* 

a p 


with probability 1 — |S'*|e Now, for the second term, using LemmaW we have 


^ ^ Vi = l,2,---,p 




6 


with probability at least 1 — 2pexp{— ] 2 c^Mo ~ Therefore, we have for any i ^ S 

1^1 CiK~^T* n 5 ciK~^t* n 

>CiK T ---> ---. 

p bp bp 


and for i G 5* we have 

n ciK~^T* n 7 cik~^t* n 

6 <c,Kn- + —--< —--, 

p bp 12 p 

where we use the assumption that r* > Now combining the result from Lemmawe 

can obtain 




„, I - , 2ci«: n\ ^. 

' i&s* 3 p J \ T*^n°‘ ) 


logp\ 

I ’ 


and 


PI > l-ofVdLsPy 

' ies. 12 p J \ r*^n" / 


Taking 7 = ^ np, we have 


P( mm|A|> 7 >max|ft|j 


□ 
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Proof of Theorem [2l and |3| 


For the selected submodel Md, we define to be the variables contained in Md and 
to be variables that are excluded from A4d- It is clear that 


^(OLS) ^ (xrXi)-'XlY = & + {X'iXar'Xle + = & + ,u + 


To prove Theorem is essentially to bound rj and u. Thus, we need following three lemmas. 


Lemma 5 (Garvesh, Wainwright and Yu. (2010) Raskutti et ah (2010)). Assume Z 


X(0, S). There exists some absolute constant c', c" > 0 such that 


'n 4 V n 


with probability at least 1 — c"exp(—c'n), where p(S) = maxj=i^2,---,p 

In our case, for any v with d nonzero coordinates, we have ||n||i < V^ll'^lb? p(S) 
and > A^ij^(S)||n||2. Therefore, 


= 1 


\\Zvh > _ g^/ dlogp ^ 


l^lh, ||^'||o<c(• 


Thus, as long as n > O^^dlogp, we have 




\L(S) 


\M\<d 


Lemma 6. Assume E[L < Mi and e[L^^] < M2. For any Ai such that S* C Ad and 
\-M\ < d, we have for any a > 0 


V \M\<d 

where A* = Amin(S). 


p( max llr^rflloo < 

V \M\<d V n“ 


-1 c{ I ^1 + ^2 


77,3 




Proof. Define A = (XjXd) ^Xj, we have 


Ti = (XjXdr^Xje = Ae 
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For A, we can bound its entries as 


max lAjjl < max |ef(XjXd) ^Xje^l < max ||e,^’(XjXd) ^||i||Xje 


T/vT V ^-l| 








] II oo 


< ydmax||ef(XjXrf) ^amaxlXjl < —max|Xj|. 
ij ij n \ n J tj 

Recall that X = where L = diag{y/pLi/\\zi\\2T ■ ■ ■ , y/pLn/\\zn\\2) and thus X^ 

possesses a representation as X^ = LZ'E^J'^, where is an p x d matrix formed by the 
selected d columns of We can now further bound A“^„| I as 


1 [eIz^L^LZeI 

'^min 1 „ I ^min 


n 


n 


< ( A^i„(FZ)A^i„(s|z^Zs|/n) 


-1 


Using Lemma it is clear that 


min X^,^{i:lz^ZT.l/n) > 


\M\<d 


64 “ 64 ’ 


with probability at least 1 — 0 {e In addition, since E[L < Mi and E[L^‘^] < M2, 
we have for any fci > 0 , ^2 > 0 


Mo 


pl 2 ' 


P{L^ < ki) < klMi and P(L > ^ 2 ) < 
Combining with equation ([TT| implies that 


A„i„(L^L) > 

5 

with probability at least 1 — — nkfMi. Therefore, we have 


,1 fXjXA ^ 162 

max 

\M\<d \ n J X^ki 


with probability 1 - 0 {nklMi). 


For maxjj |Xj|, we just need to bound maxjj X^. Using the representation X = LZY}I‘^, 
we know that 


X,, = '^Z,Y}l‘^e,. 
P 2 
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It is easy to see that is a Gaussian random variable with mean zero and variance 

1, thus for any t > 0 


P{\ZiY}l’^ej\ >t)< 


In addition, ||2:i||i/p follows a X^{p) and we have 


Taking t = p/4, we have max* i|-^i|| 2 /\/p ^ 1/2 with probability at least 1 — and thus 


F(max \Xij\ < 4^2 Vlogp) > 1 
Combining all pieces of results, we obtain that 


M 2 n 

pl2 

1^2 


2p ^ — ne 


p( mm ma^lAyl < 648A^i/VI^t > i.o(uk\M, + ^ 

V|-M|<d U A*fcin J \ kp 


Following a similar argument in proving Lemma we dehne Wj = {Aijej, ‘ ‘ ‘ > 

and then 

n 

i=i 


Using Proposition!^ we have 

n n 

EWvWl = -Bll 5 ^ lUIlL < C\ogdY,E\\W,\\l < o 


i=i 


i=i 


dlogdlogpA 

Xlkl n J 


Using the Markov inequality implies that for any r > 0 


P I max I 

V \M\<<i 


> r ] < 


— j ,2 


= 0 


a‘^k\ dlogdlogpA 


X^kfr'^ n J 


+ 0 nk\Mi + 


nM 2 

Ul 2 

^2 


Let r = ki = n 's' and k 2 = n s , we have 


d/ II II / /logP^ 1 ^(Xjdlogd M1 + M2 
P\ max m 00 < o'\ =1 — 0 —^— I r- 


□ 


Lemma 7. Assume E[L < Mi and e[L^^] < M 2 . For any Ai such that S* C jCi and 
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|7W| < d. Assume that c? — [S'*! < c and some l G (0,1), then for any 

a > 0, we have 


P 


max \\w\\ 2 <a\f^'] >1-0 

\M\<d \ ) 


{M^ + M2)R^ \ 


Proof. According to our definition that u = {XjXd) ^XjXd,cPd,c, we can directly bound 
the I 2 norm of uj as 



where A 


-1 / 


min \ n 


has already obtained a bound in Lemma 


as 


,1 [XjXA ^ 162 

max 

\M\<d \ n J A*/ei 


with probability 1 — 0{nk\Mi). Now for XfjJ^Xj^Xd^ct^d,c we have 




n 


n 


n 


I \\z. 


*112 


Since Z ~ A^(0, Ip), we can choose an orthogonal matrix Q such that (Id,cZ^]i^c ~ ^iQ\\l^d,(Xd,c\\‘i 
and 


al/2| 


llfcsiAl2e,Z*'Zer < \\M\lre,Z-^Zeu 


where Z ~ A^(0, Ip). It is easy to see that for any f > 0 


e\Z^Zei t 2t . , 

P{ ^^ < 1 + 2W - + - ) > 1 - e-* 
n \ n n 


and ||/3(i,c||2 — X Thus, taking t = (1 + c) logp, we have 


max < Srt^Ry 

\M\<dn 


with probability 1 — p ^ as long as n > (1 + c) logp where c is the upper bound on d — [S'* 
For maxjpL^/ll^jlll, we follow the same argument in Lemma 




* \\Z. 


*112 


pl2 • 
1^2 















Putting all pieces together, we have 


max 

\M\<d 



with probability at least 1 — 0 



According to our assumption that r* < 


-\/and taking ki = 




and ko 


1 / y/ki we have 


P 


II II . Pogp\ 

max tc 2 < o"\/- > 1 — h' 

\M\<d y n‘^ J 


(log p) 2^77,3-40-21, J 


□ 


We are now ready to prove Theorem 

Proof of Theorem\^ We just need to combine the results of Lemmaandi.e., 

= Pd + V + ^, 


where 


P 


max llr^lloo < cr 

\M\<d 



k X^'^dlogd Mi + M2\ 

V 77,3^^”“^ 77,3P“^“) ) 


and 


P 


II II . Pogp\ 

max tc 2 < <^\ - > 1 — L' 

\M\<d y J 


(logp)2''n^“^"“^W 


Therefore, we have 


P 


( 


max 11/3^ 

V |,M|<rf,S*CA4 


{OLS) 

d 


/3d\\oo < ‘^cr 



/ X-^d log d Ml + M 2 (Ml + M 2 )R^ 

V (logp)2^n3-4«-2t 


□ 


Proof of Theorem Recall that Xd consists of variables contained in the dehnition 
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of becomes 

= (XjX, + rh)-^XjXd(3 + (XjX, + rhY^Xje + (XjX, + rh)-^XjX^^Yd,c 
= (3- r(XjX, + r/,)-V + (XjX, + rQ-^Xje + (XjX, + rI,)-^XjXa,A,c 

= l3 - YY + vY) + i^Y)- 


For ^(r) we have 

\m\\l<rY^{XjX, + rI,)-^(3< 

As proved in Lemma we know that 


rY/3Y 


< 




n^^mm{XlXd/n + r/n) n 


. I XjXA ^ Kh 

max A™,;„ - > 


\M\<d 


n 


162 


with probability 1 — 0{nk\Mi). Adding r/n to the above matrix will only increase the 
smallest eigenvalue. Thus, we have 

Where we used Mq > var{Y) > ||/^|| 2 -^maa:(^)- Choosing ki = n~ Y \ we have 


Ml 162rKMoA Mi 

FI rnax ||^(r )||2 < i I =1-0 


. „Ar)||2< M 


ns 


i(l-4a) F 


which implies that as long as r < 


162/tAfo 


we have 


p( max ||f(r )||2 < a 
\ \M\<d 


/logp\ 

V n“ / 


= 1-0 
/ vns 


Ml 


^ (1—4a) 


In addition, the proof for ||?7||oo and ||a ;||2 shows that the only key quantity that has changed 
is maxi^K^ which is replaced by maX|^|<^ A^i„ (for While 


^\M\<d^niiny „ J in icpiaucu ^ fo 

the latter is trivially lower bounded by the former, we thus have 


F 


max ||?7(r)||oo < cr 

\M\<d 



/A^^dlogd Mi + M2\ 

V nsd"®) 77 , 3 b“ 4 o) J 
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and 


P 


max ||w(r )||2 < >1 — 0 

\M\<d ^ n°‘ J 


{M^ + M2)R^ \ 


Consequently, we have 


P 


max 

\M\<d,S*(lM 


/^dlloo < ^O- 



f x:^d\ogd 2M1 + M2 {m, + M 2 )r^ 


as long as 


Wi8)^iogp 
162 kMq 


□ 


Proof of Corollary As mentioned before, we have = 

Because Si ~ A^(0, a^) for z = 1, 2, • • • , n, we have for any z G Ad 


’Z = <(AV A'^. 


)-‘aL/ ~ 


Af(0,CT^ 









'^nAj,/^,)-‘ei]V(0.1), 


( 12 ) 


Likewise in the proof of Lemma ??, we know that as long as n > hd^dlogp 




1 

64k 


Thus, we have 


maxef(Xj^ < 64K/rz. 

ieXd ^ 


Therefore, for any f > 0 and z G Ad^, with probability at least 1 — c"exp(—c'n) — 
2exp(—1^/2) we have 


m < at^efiX^^ 


X 


Md 




Ci < 


8K‘2at 


n 


Then for any 5 > 0, if n > log(2c"/(5)/c', then with probability at least 1 — h we have 


max \r]i\ < 8cr 
ieMd 


2Klog(4d/S) 


n 


(13) 


Because a needs to estimated from the data, we need to obtain a bound as well. Notice that 
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(T^ is an unbiased estimator for a, and 


= aV(I„ - 


a^X^(n - d) 
n — d 


where X‘^{k) denotes a chi-square random variable with degree of freedom k. Using Propo¬ 


sition 5.16 in Vershynin (2010), we can bound as follows. Let K = ||A’^(1) — lH^^. There 
exists some C 5 > 0 such that for any f > 0 we have, 


P 


— d) 


n — d 


- 1 


> f ) < 2 exp < — C 5 min 


P{n — d) t{n — d) 


iP2 


K 


Hence for any 5 > 0, if n > d + AK"^ log( 2 /h)/c 5 , then with probability at least 1 — h we have. 


cr 


a 


< c^V2, 


which implies that 

-a < a < -a . 
2 - - 2 

Then we know that 


max \ fii\ < 8a 
i&Md 


2K\og{Ad/5) 


n 


< 8V2: 


a 


2K\og{Ad/S) 


n 


< 8^3, 


a 


2K\og{4d/6) 


n 


Now define 7 ' = 8 y/2a _ jf signal r = minjg 5 \/3i\ satisfies that 


r > 24cr 


2K\og{Ad/5) 


n 


then with probability at least 1 — 25, for any i ^ S 


2,K\og(Ad/6) , 

m = |7*l <8aV-< 7 . 


n 


and for i G S' we have 


i«i^ /2Klog(4ci/5) , 

|Pi| P T — max \r]i\ > 16a\l -> 7 . 




n 


□ 
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Proof of Theorem 4 


The result of Theorem 4 can be immediately implied from Theorem Bii 
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